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Abstract 

High-level formalisms such as stochastic Petri nets can be used to model complex 
systems. Analysis of logical and numerical properties of these models often requires 
the generation and storage of the entire underlying state space. This imposes practical 
limitations on the types of systems which can be modeled. Because of the vast amount 
of memory consumed, we investigate distributed algorithms for the generation of state 
space graphs. The distributed construction allows us to take advantage of the combined 
memory readily available on a network of workstations. The key technical problem is 
to find effective methods for on-the-fly partitioning, so that the state space is evenly 
distributed among processors. In this paper we report on the implementation of a 
distributed state-space generator that may be linked to a number of existing system 
modeling tools. We discuss partitioning strategies in the context of Petri net models, 
and report on performance observed on a network of workstations, as well as on a 
distributed memory multi-computer. 
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1 Introduction 


Discrete-state models are a valuable tool in the representation, design, and analysis of com- 
puter and communication systems, both hardware and software. We are particularly inter- 
ested in stochastic formalisms, where some probability distribution is associated with the 
possible events in each state, so that the model implicitly defines a stochastic process. These 
are then used to carry on performance, reliability, or performability studies. 

Most real systems, however, exhibit complex behaviors which cannot be captured by 
simple models having a small or regular state space. Given the high expressive power of 
formalisms such as Petri nets [21, 20], queuing networks, state charts [15], and ad hoc 
textual languages [12], the correct logical behavior can, in principle, be modeled exactly. The 
timing behavior is then defined by associating either an arbitrary probability distribution 
to the duration of each activity (resulting in a stochastic process usually solved by discrete- 
event simulation), or an exponential or geometric distribution (resulting in a continuous-time 
Markov chain (CTMC) or a discrete-time Markov chain (DTMC), respectively). 

We focus on the CTMC case, where, with the exception of very special circumstances, 
such as the existence of product-form solutions [3, 23], or of extensive symmetries [5, 11], 
the numerical solution requires the generation and storage of the entire state space. This is 
the main drawback of the numerical approach, since the size of the state space can easily 
be orders of magnitude larger than what can be stored in the main memory of a single 
workstation. 

An important aspect of the problem is that the amount of memory needed for the state- 
space is much larger than that needed for the numerical solution. This differential arises 
because, during generation, states (typically vectors of integers) must be represented whereas, 
during the numerical solution, integer- value codes for states suffice. Another important 
aspect is that the computational time spent generating a state-space is of the same order as 
the time spent solving it. Because of these facets, it is conceivable to generate a large model’s 
state-space on a handful of processors, then transfer the encoded state-space to a single 
processor for numerical solution. This technique increases by an order of magnitude the size 
of models that can be solved numerically. It is an approach suitable for multiple workstations 
on a local- area network (LAN). The ubiquitous presence of LANs makes this approach very 
palatable, as it effectively offers a much larger overall amount of memory and computational 
power to the analyst, without requiring the purchase of new hardware. However, one should 
note that the serial solution phase will remain a bottleneck until parallelized. The utility 
of the approach is for logical analyses that can also be parallelized (we discuss instances of 
these), and for capability of solving models too large to tackle on a single workstation. 

Section 2 presents the interface used to integrate an existing modeling tool with our 
distributed algorithm. Sections 3 and 4 discuss sequential and distributed state space explo- 
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ration, respectively. Section 5 presents two analysis algorithms that can be applied to the 
state space generated in a distributed fashion. 

Our approach is not tied to a particular formalism. This greatly simplifies the paral- 
lelization of any state-space-based modeling tool. In particular, we have, for now, applied 
the approach to the tool SPNP [8], and report the performance results in Section 6. Section 
7 summarizes our work and discusses our plans for further investigation. 

2 A general interface to a distributed engine 

Our goal is to provide a “distributed analysis engine” which can be connected to any 
“discrete-state formalism front-end”. Hence, the engine implementation must not depend 
on the type of formalism described by the front-end. While our data are obtained by using 
a stochastic Petri net front-end, nothing in the engine reflects this. Indeed, we are able to 
integrate the engine we describe with a commercial modeling tool, BONeS Designer. Our 
present tool (with its capability for stationary analysis) may serve as a substitute for the 
transient analysis engine we’ve also integrated into Designer [18]. 

In general, we can say that the reachability set S, the set of states reachable from a 
given initial state s 0 , is a subset of some structured countable set, often IN” for some n. The 
reachability graph (S', A) is a directed labeled graph whose nodes and arcs are the reachable 
states and the possible state-to-state transitions, respectively. Each arc is labeled with the 
identity e of an “event”: s-As' means that event e causes a change of state from s to s'. If 
two events e\ and e 2 can cause the same change of state, they correspond to distinct arcs in 
A. The model defines which events are enabled, i.e., can occur, in each state s, among the 
set of possible events E. 

A model can then be considered as a way to define a set of functions which define the 
interface between the engine and the front-end. Besides reflecting good software engineering 
practice, this approach highly facilitates the integration of the engine to new front-ends. The 
state space S is implicitly described by the following functions: 

• Initial , with no input parameters, returning the initial state s 0 for the model. 

• Enabled(s), returning the (finite) set of events enabled in state s £ S. 

• NewState(s } e), returning the state reached from state s £ S when event e £ Enabled(s) 
occurs. 

• Comparers-!, s 2 ), returning the result of the comparison between two states, SMALLER , 
EQUAL , or LARGER. This function prevents the engine from having to know the 
structure of the state, yet it allows to perform an efficient search for a given state in a 
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large set of states (for example using a binary search). The only assumption is that a 
total order can be defined over the set of reachable states. 

To define the stochastic behavior, additional functions are needed, depending on the 
type of stochastic process underlying the model. The simplest case is when all events have 
exponentially distributed durations, resulting in an underlying CTMC. Then, we only need 
a function 

• Rate(s,e), returning the rate at which event e £ Enabled(s) occurs in state s £ S in 
isolation. 

In many models, however, it is useful to describe “instantaneous events” which can occur 
in zero time, as soon as they become enabled. In GSPNs [1], this is achieved by immediate 
transitions; in queuing networks, by passive resources. If a state enables an instantaneous 
event, “timed events” cannot occur, they are de facto disabled. We disallow infinite sequences 
of instantaneous events; while these subtle situations can be managed [6, 13], they usually 
indicate modeling errors. Then, the state space S can be partitioned into two classes, of 
“timed” and “instantaneous” states: S T and S' 1 , where s £ S 1 iff it enables instantaneous 
events. The additional functions needed to describe this class of models are: 

• Timed(s), returning TRUE or FALSE according to whether s is timed or not. 

• Weight(s,e), returning the weight, a nonnegative real number, for the occurrence of 
the instantaneous event e in the (instantaneous) state s. The probability of event e in 
s is then obtained by normalization: 

Weight(s,e) 

Rrob{s,e ) = , — - — - 

J2 e 'eEnabied{s) W eighths , U) 

This definition allows for some, but not all, of the enabled instantaneous events to have 
zero weight in a given state s. 

In certain existing tools, this interface is inadequate because it might be much more 
efficient to compute the rates or probabilities for all enabled events in a given state s with 
a single function call. We ignore this aspect for readability’s sake, but we observe that the 
algorithms we present are not affected in any substantial way by this choice. 

Finally, a model is used to study some quantity of interest. We assume this to mean the 
expected value of a stochastic reward process at some point in time, or in steady state. This 
process is defined by means of a reward rate function p defined over the state space: given 
a state s £ S', a given reward p(s) is gained for each unit of time the model is in state s. 
Hence, for example, the expected steady-state reward rate of the model is 

Z)/ , ( s ) 7r «, 

s£S 
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where 7 r is the steady-state probability vector. If we compute the expected cumulative time 
spent in each state during the interval (ti,t 2 ], and we use this vector instead of 7r, 

the previous sum will result in the expected cumulative reward gained during this interval, 
and so on. In practice, multiple reward rate functions are specified, so we define 

• Reward(s } k) } returning the value of the k - th reward rate function evaluated in state 

s. 


3 State space exploration 

In many studies, the logical analysis of the model is of interest in itself. For example, we 
might want to explore qualitative properties such as absence of deadlocks and livelocks, 
reachability (possibility of reaching states satisfying certain conditions), liveness, and so 
on. If the distributions of the durations of the timed activities have unbounded support 
(e.g., a geometric or exponential distribution), the timing and probabilistic behavior can be 
ignored, except for the restriction that timed events must be considered disabled whenever 
an instantaneous event is enabled. 

The state of the model is represented as a structured quantity, often of fixed size. For 
example, the state of Petri net is given by the number of tokens in each place (which could 
be stored as a fixed-size vector of nonnegative integers), the state of a multiclass queuing 
network is given by the number of customers of each class in each queue, and so on. 

More complex storage schemes might be devised to save storage, often based on the 
existence of model invariants. For example, in a closed queuing network or in a Petri net 
covered by P-invariants [17], the customer populations at each queue, or the token population 
at each place, satisfy certain linear relationships. Sparse storage techniques can also be used 
to store a state, and it is possible to store an integer in just [log k~\ bits, if an upper bound k 
on its value is known (again invariants can be used for this purpose) [2], We do not discuss 
these techniques here, since they are independent of our method and apply equally well to 
both sequential and distributed analysis. 

Since S and (S', A) are defined only implicitly by the model, their size and characteristics 
might not be known a priori: 

• S might be finite or infinite. We assume that it is finite, but its size is normally very 
large and known only at the end of the exploration. 

• (S', A) might be strongly connected, or it might have a node s; which reaches a node 
Sj, but is not reachable from it. We then say that node s; is transient, since there is a 
positive probability that the model will never enter s; again after leaving it. 
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Procedure ExploreS equential 

1. S t — {hnzfzn/}; S new < — S'; A * — 0; 

2. while 3 s £ S new do 

d* $new ^ $new \ 

4. for each e £ Enabled(s) do 

5. s new <— NewState(s } e); 

6. if s neu , ^ S' then 

f* $new * $new 0 

8. S <— S U {s neu ,}; 

9. end if; 

10. A <— A U {s^s neu ,}; 

11. end for; 

12. end while; 

Figure 1: Sequential state space exploration 

In certain models, the existence of transient states is an indication of a problem, either 
in the model, or in the system being modeled. S contains a transient state if and only 
if s 0 is transient, so it is sufficient to test for this condition (see Section 5). 

• Some system characteristics of interest might be studied by considering the reachability 
set alone. For example, we could define a complex condition which we hope never 
arises in the model (a circular wait in the model of a tasking system, or an inconsistent 
status in the model of a communication protocol). This condition can be expressed 
as a function c : S — > {0, 1}, where 0 corresponds to “good” states not satisfying the 
condition, and 1 corresponds to “bad” states. For any reachable state s £ S, we can 
then check the value of c(s). Clearly, this analysis requires only to store the reachability 
set S (strictly speaking, it requires to enumerate it, but, if (S', A) contains cycles, the 
graph search requires in general to store all previously explored states, to avoid being 
caught into an infinite loop). Other system characteristics might require to store the 
reachability graph (S', A). 

The sequential algorithm for state space exploration is shown in Fig. 1. If S new is stored 
using a single-link list managed as a FIFO queue, the reachability graph is explored in 
breadth-first order. Each element in the list contains, either directly or by pointing to it, a 
different state. If A is not needed, the statements referring to it in Fig. 1 can be omitted. 
Note that the pseudo-code assumes that S is finite; if not, the algorithm will not halt. 
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3.1 Generation of the stochastic process 


When the questions asked of the model refer to the timing and stochastic behavior, a stochas- 
tic process (not just a reachability graph) must be built as a result of the state space explo- 
ration. If the underlying process is a CTMC, this means building an infinitesimal generator 
matrix Q, where Q SyS > is the rate of going from state s £ S T to state s' £ S T , for s ^ s'. The 
diagonal entries of Q are defined to be the negative of the sum of the off-diagonal entries on 
the corresponding rows, Q SjS = —J2 s 'es T ,s'^sQs,s ’ ; in our discussion, we assume that they 
are stored explicitly only during the CTMC solution. It is usually more efficient to use an 
alternative exploration algorithm, which stores only the timed states T. Analogously, only 
a “reduced reachability graph”, basically equivalent to Q, needs to be stored. Whenever an 
instantaneous state is found, a depth-first search is initiated, to determine the set of timed 
states reached. If 


eo ei e 2 e /t-l e k , 

s — rSi — 2 — ^ • • • — ^ 


where s, s' £ S T and si, s 2 , • • • , Sk £ S 1 , the rate of going from s to s' along this path is 

k 

Raters, e 0 ) • Prob(si , e 8 ) 

i = l 

and Q s>s i is the sum of these rates over all possible paths of this type from s to s', including 
paths of length one, that is, paths with no intermediate immediate states. 

This is illustrated in Fig. 2. Procedure BuildQ is called first, which in turn uses the 
recursive procedure GenerateTransitions. For simplicity, we assume that S T , S^ ew , S^ ew , 
and Q are global variables and that the initial state is timed (the algorithm can be easily 
modified if this is not the case). S^ ew is used as a stack, and is needed to recognize cycles 
of instantaneous events, which we consider illegal. When the execution returns to BuildQ , 
S^ ew is empty, that is, instantaneous states are stored only temporarily. Note that entries 
of Q are incremented, not set, by statement 6 in procedure GenerateTransitions. This is 
because multiple paths of instantaneous states might exist between the same source and 
destination. 


4 Distributed state space exploration 

Like the sequential algorithm, the distributed algorithm shown in Fig. 3 performs a 
breadth-first exploration of the state space. Each state reached is either explored locally or 
sent to another process. For the distributed algorithm, we then define another function in 
the interface: 

• Partition's, N), returning the identity of the process to which state s is assigned, an 
integer between 0 and N — 1. 
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Procedure BuildQ 

1. S T <- {Initial}] Sf ew <- {JrnAa/}; A 7 ^ <- 0; Q <- 0; 

2. while 3 s £ do 

3- ‘‘hfeio ‘‘hfeio \ I 5 }! 

4. for each e £ Enabled(s) do 

5. s neu , <— NewState(s, e); 

6. GenerateT ransitionsfs , Rate(s, e), s neu ,); 

7. end for; 

8. end while; 


Procedure GenerateT ransitionsit , r, t new ) 

1. if Timed(t new ) then 

2. if t new (f S T then 

3- * ^raeto ^ {fraeto}) 

4. S' 7, <— S' 7, U {fnetu}; 

5. end if; 

6 - Qt,t new Qt,tnew + T] 

7. elsif A eu , ^ A 7 ett , then 

8- ^neio < ^netu ^ {lnew}i 

9. for each e £ Enabled(t new ) do 

10. GenerateTransitionsft, r ■ Prob(t neW} e), NewState(t new , e)); 

11. end for; 

12- f? 7 ^ <— A 7 eu , \ 

13. else 

14. error(“cj/c/e of instantaneous events ”); 

15. end if; 


Figure 2: Sequential generation of Q 

Assuming we have N processes running on N processors, this function partitions the state 
space into N classes, one assigned to each process(or), and is a critical factor affecting the 
performance of the distributed algorithm (see Section 6.1). 

The incidence matrix of the reachability graph is stored in column- wise format. Hence, 
when process i determines that state s new “belongs” to a remote process j ^ z, it sends both 
s new and the arc leading to it, s-Gs new , to j. As the representation of a state can be quite 
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Procedure ExploreDistributedi 
1. if Partition(Initial) = i then 


2 . 

S l <— {Initial}-, 

3. else 


4. 

S' 8 ' 0; 

5. end if; 

fi S'* 

yj ' ^ next 

, <- S' 8 ; A 8 <- 0; 

7. while “not received terminate message ’ 

8 . 

while 3s G S l new do 

9. 

S'* <— S'* \ Jot- 

u new u new \ l J ) 

10 . 

for each e £ Enabled(s) do 

11 . 

s new NewState(s, e); 

12 . 

j = Partition(s new , N); 

13. 

if j 7 ^ 7 then 

14. 

SendState(j, s new ); 

15. 

SendArc(j, s-A Snew ) 

16. 

else 

17. 

if s new (jt S 8 then 

18. 

S'* S'* II 

19. 

S ' 8 S ' 8 u {s neu 

20 . 

end if; 

21 . 

A 8 A 8 U {s^s neu , 

22 . 

end if; 

23. 

end for; 

24. 

end while; 

25. 

S new <- u ReceiveStates ; 

26. 

A 8 <— A 8 U Receive Arcs ; 

27. end while; 


J; 


}; 


}; 


Figure 3: Distributed state space exploration using N processes 

large, states are assigned an index. Locally (i.e., in process z), state s £ S l is identified by 
the index k, if s is the k- th state added to S\ Globally (i.e., in process j ^ z), state s is 
identified by a two-component index: (k,i). State indices, rather than actual states, are 
stored and exchanged whenever possible. 
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When i sends state s new to j, with the function call SendState(j, s new ), the local index of 
s new in j is not known to z, so the actual state must be sent. However, when the arc s-As new 
is sent, with the function call SendArc(j, s-As new ), the global state index of s is sent, since 
it is known to z. Also, the destination (actual) state s new does not have to be sent a second 
time to describe the arc, since states and arcs are always sent in pairs. 

The functions ReceiveStates and ReceiveArcs return all the states and arcs sent to 
process z since the last time they were called, respectively. Each state x in the set returned 
by ReceiveStates has a corresponding arc (ki,j)-Ax in the set returned by ReceiveArcs. 
The set S l is then searched. If x £ S\ its previously assigned index is retrieved, otherwise 
x is added to S l and a new index is generated for x. If k 2 is the local index of x, an arc 
(&i, j)—^(k 2 , z) is added to A\ These are high-level descriptions; the actual implementation 
details for the communication mechanism are beyond the scope of this presentation. 

In the sequential algorithm, the choice between storing the incidence matrix of the reach- 
ability graph in row- wise or column- wise format is irrelevant. For the distributed version, 
however, a row-wise format would require a more complex protocol. With row-wise storage, 
the entry s 1 Ss 2 is stored by process i as (£q, i)-A(k 2 , j), if Si is assigned index (£q,z) and s 2 
is assigned index ( k 2 ,j ). However, i does not know the local index k 2 of s 2 , so it must send 
s 2 to j, and wait for ( k 2 ,j ) in return. Only then i can complete the storage of the entry in 
the incidence matrix. With the column-wise format we use, i simply sends the pair s 2 and 
(&i, z)-As 2 to j, without having to wait for further information from j, since it is up to j to 
fill-in the value for the arc destination. 

The communication complexity of the distributed state-space algorithm is then one (ac- 
tual) state, one state index, and one event, for each “cross-arc” (an arc from a state in S l to 
a state in S\ i ^ j) . 

When process i finishes exploring its local states {S l new is empty), it waits for more 
states and arcs from other processes. When all processes have finished their local work and 
are waiting to receive a message, the distributed state space exploration has completed. 
Detecting termination is a well-known problem with many solutions. In the workstation 
network data we present, we used the circulating probe algorithm described by Dijkstra et 
al. [10]. We have since made the engine portable by using MPI [14] as the communication 
mechanism, and in that context employ the scalable “Non-committal barrier” described by 
Nicol [19]. 

4.1 Distributed generation of the stochastic process 

For brevity’s sake, we do not present the pseudo-code for the distributed generation of 
the underlying stochastic process, obtained by merging the algorithm for the distributed 
generation of the state space of Fig. 3 with the elimination of the immediate states used in 
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the algorithm of Fig. 2. Only timed states are assigned to a particular process using the 
partition function. Immediate states are managed in the process that generates them, and 
then discarded after all the timed states reachable from them have been explored. Storing 
the immediate states together with the timed ones is a reasonable alternative (see [6, 4] for 
the tradeoffs involved in storing these states permanently), but is probably less appropriate 
if the paramount goal is to minimize storage requirements. 

At the end, process i contains the states S T,t = {s £ S T : Partition's, N) = i} and the 
entries of Q corresponding to arcs reaching these states, Q, s t,i. 

4.2 Implementation issues 

Before concluding this section, we discuss a few implementation issues. Communication 
between processes is accomplished through message passing. Reliable message passing is 
provided by acknowledged messages: a sender does not continue until the receipt of its 
message has been acknowledged. Since the receipt of a message generates a signal, the 
receiver can acknowledge the message almost immediately, thus minimizing the waiting time 
for the sender. 

Because of the potentially high number of states sent to another process, each state/arc 
pair is buffered in the sender. The buffer size is a compilation parameter. As the size 
increases, more states and arcs can fit into a single message, and fewer, although larger, 
messages are exchanged. This reduces one type of overhead, but it also increases the like- 
lihood that a process j remains idle waiting for states to be imported, while some other 
process i delays sending states that j should explore because the buffer is not full. 

5 Distributed analysis of the model 

Once the state space is built, analysis can proceed. We present two types of state-space- 
based analysis. The first type analyzes logical properties of the state-space; we point out 
a number of questions that can be answered in a distributed fashion using the distributed 
state-space. The second type is numerical analysis; we presently perform this sequentially, 
but point out some unexpected ramifications of our distributed generation of the state-space. 
In the following, we use the following symbols: 

• n and rj are the total number of tangible states and arcs stored; n = |A T | is the 
dimension of Q and r] is the number of nonzero entries in Q: rj = |{(si,s 2 ) : Q Sl ,s 2 > 0} | . 

• n l and rf are the analogous quantities for process i in the distributed algorithm: n l = 

|A T ’*| and rf = |{(si,s 2 ) : Qs 1 ,s 2 > 0,s 2 £ Furthermore, we define r] J,t to be 

the number of entries stored by i corresponding to events originating from states in j: 
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T] 3 ’ 1 = |{(si,s 2 ) : Q S1 ,S 2 > 0,5i £ S T ’ 3 A s 2 £ S' T ’*}|. Hence, E^Io* A = n, E^Io* = V , 
and EfEo 1 V 3 ’ 1 = rf ■ 

5.1 Distributed detection of logical state properties 

The notion of reachability pervades logical analysis of state-spaces: “is it possible to reach 
some state si from another state s 0 T\ As we will see, solution to this problem permits one 
to address higher-level questions. For instance, to determine whether there are any transient 
states, it is sufficient to test whether the initial state is transient, that is, whether there exists 
a state s £ S T which does not reach sq. This is equivalent to determining whether there 
is a state s unreachable from s 0 in the “reverse reachability graph”, obtained by reversing 
the direction of all arcs. A simple breadth-first search algorithm that identifies all states 
reachable from s 0 in the reverse graph can be used for this purpose — if any state remains 
untouched, s 0 is transient. An efficient implementation requires a row- wise storage of the 
incidence matrix of the reverse graph, and this is a further reason to use a column- wise format 
for the storage of the incidence matrix of the original graph, since one is the transpose of 
the other. Note that the state-space generation process is itself a breadth-first search, and 
that the algorithm for testing whether s 0 is transient is essentially the same. 

A sequential breadth-first search algorithm requires 0(r]) operations. A distributed im- 
plementation requires the same number of operations, but also O(x) communication, where 

X = E A 

is the number of cross-arcs. 

The communication cost is then of the same order of complexity as for the state space 
generation, although now only state indices, not the actual states, are sent between processes. 

Many other important questions about the behavior of a system can be answered in a 
distributed way using the information in the reachability set and graph: 

• Reachability: a condition c is reachable if there is a state s £ S satisfying c. Each 
process i can simply test for this every time it adds a new state to S\ that is, this 
question can be answered without further examining the reachability graph. Deadlocks 
are just a special case: a deadlock is an absorbing state, that is, a state which does 
not enable any event. 

• Livelock: a livelock is a set of states L, 1 < \L\ < |5j such that, once L is entered, no 
state in L — S can be reached. Formally, the reachability graph must contain a strongly 
connected component with two or more nodes and no outgoing arcs (no way to leave 
the component). This is equivalent to testing whether the initial state is transient 
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in a modified reachability graph without absorbing states (these can be easily tagged 
during state-space generation). 

• Liveness: an event e is not live if there is a state s £ S such that, once s is entered, e 
can never become enabled. If we define S e to be the set of states where e is enabled, 
liveness can be established by checking that every state in S can reach a state in 
S e . The same breadth-first algorithm used to determine whether the initial state is 
transient can be adopted; the only difference is that the search in the reverse graph 
proceeds from the set of states S e , not from s 0 alone. 

• Conditional reachability: we are sometimes interested in determining whether there 
exist two states si, satisfying condition ci, and s 2 satisfying condition c 2 , such that 
Si reaches s 2 . If the entire graph is strongly connected, this is equivalent to asking 
whether conditions c\ and c 2 can be satisfied. Otherwise, a modified version of dis- 
tributed breadth-first search algorithm used to determine whether s 0 is transient can 
be employed. Instead of starting from s 0 , we start from a set of states S 2 = {s : c 2 is 
satisfied in s}, and we determine the set of states R reachable from S 2 in the reverse 
graph. Our goal is to find a state in R satisfying Si. 

5.2 Numerical solution of the underlying stochastic process 

In the current implementation, the numerical solution of the CTMC is centralized. While this 
prevents us from obtaining good speedups, it is important to remember that our immediate 
goal is to increase the size of models we can solve. As pointed out earlier, the large difference 
in memory requirements for the generation and solution phases means we can solve models on 
one processor that are an order of magnitude larger than we can generate on one processor. 

After building Q and testing for a transient initial state, each process sends its portion 
of Q to a solver process where the numerical solution is performed. This is reasonable given 
our current target level of parallelism, up to a dozen workstations. 

The “solution” sought for the CTMC is normally the steady-state probability vector 
7 r satisfying ttQ = 0, if Q is ergodic, or the sojourn times in each transient state until 
absorption, or the transient instantaneous or cumulative probability in each state. In any 
case, the solution is given by a real vector v of size n. 

Vector v is generally not all that one wishes to know about the model. Rather, v is 
used to compute the expected “reward” earned by the model. Rewards are a function of 
individual states; to compute the reward for state s one must generally have available the 
full state representation of s. Consequently, after computing v serially, we distribute it back 
to the processors holding the full state-space description. Recall now that the compact 
representation of a state identifies the processor that owns it. It is straightforward then to 
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return each component u* to process z, for z £ {0, . . . , N — 1}. It is then possible to compute 
the expected value of a measure, to = J2 s es T Reward(s) ■ v s in a distributed way. Process 
z computes to* = J2 s es T ’' Reward(s) ■ v S} and a master process combines these subresults as 

™ = Eilh 1 m% - 

Note that a number of measures may be computed simultaneously simply by using differ- 
ent reward functions. Since the number of requested measures can be quite large in practice, 
their distributed computation can result in a substantial time saving. 

An interesting observation should be made at this point. It is well known that the 
ordering of the variables (states) can affect the speed of convergence for iterative methods 
such as Gauss-Seidel and Successive-Over- Relaxation (SOR) [24, 22], We indeed experienced 
this phenomenon when studying the number of iterations required by a sequential SOR 
implementation. In our first implementation, all the states in S T,t were ordered before those 
in S T,t+1 , i G {0,1,..., N — 1}, while the sequential state-space exploration results in a 
breadth-first order, starting from s 0 . The ordering from the distributed implementation 
regularly required more iterations, even if the SOR implementation was exactly the same. 

We conclude that the natural breadth-first order by which states are generated and 
indexed in the sequential implementation is a better choice. To verify this, we sorted the 
states in the solver process according to a breadth-first order: if the distance from s 0 to s; is 
less than the distance from s 0 to Sj, Si is assigned an index smaller than s r This does not 
necessarily achieve exactly the same order as in the sequential implementation (since multiple 
total orders are compatible with the above partial order), but it does result in approximately 
the same number of iterations in the two implementations. We believe that state ordering 
will become an issue in a distributed implementation, where it requires reshuffling states, 
and the corresponding columns of Q, among the N processes. 

The partition heuristic might affect the convergence of a distributed solution in other 
ways as well. We have not yet considered these aspects in detail, but it is clear that the 
actual numerical values of the rates of the state-to-state transitions, rather than the mere 
existence or absence of an arc, will need to be taken into account in this case. 

For example, the idea of decomposability [9] is based on finding a block partition of the 
transition matrix where the entries of the off-diagonal blocks are orders of magnitude smaller 
than those in the diagonal blocks. This ensures that, after entering a block, the stochastic 
process reaches an “approximate steady-state” before moving to a different block. If the 
partition heuristic is such that each class corresponds to one or more blocks having this 
property, then several attractive iterative methods will be appropriate, since most of the 
iterations will occur within a single class, while only a few global iterations requiring the 
exchange of data across processors will be needed. A good candidate for this application is 
the multi-level algorithm developed by Horton and Leutenegger for the numerical solution 
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Figure 4: The FMS stochastic Petri net. 


of Markov chains [16]. 

6 Results 

We consider the model of a flexible manufacturing system (FMS) shown in Fig. 4. We 
omit a description of this SPN, since we are focusing on a comparison of the sequential and 
distributed algorithms for its analysis. The interested reader can consult [7] for a detailed 
presentation of its behavior and the meaning of its places and transitions. For this discussion, 
it is sufficient to observe that, as the number k of initial tokens in the three places PI, P 2, 
and P 3 increases, the number of states n and arcs r/ increases sharply (see Table 1). The 
first partitioning function used (to be described) assigns states to processors depending on 
the markings in places PI, P 2, P 3. 

The largest size listed ( k = 6) exceeded the storage capacity of a single workstation, 
yet was solvable (in approximately 40 minutes) by generating the state-space using five 
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Figure 5: Speedup for generation alone (left) and for overall solution (right). 

processors, and solving it on one. To assess how well the generation process parallelizes, 
we consider speedup on problems small enough to solve on one processor. Fig. 5 shows the 
speedup obtained by running our distributed algorithm on a set of homogeneous workstations 
(SPARCstation 10 class) communicating over a lOMbs Ethernet, for various values of the 
initial number of tokens k. The left plots refer to the timing collected for the generation of 
the state space alone, while the right plots refer to the timing for the entire solution process, 
assuming a single measure: the expected total number of tokens in PI, P 2, or P 3, in steady 
state. As previously noted, computing a single measure results in the lowest possible overall 
speedup since the computation of the measures can be almost perfectly parallelized. 

The speedup with N processes is obtained by dividing the runtime of the sequential 
solution (N = 1) by the runtime of the distributed solution with N processes (defined as the 
maximum processor runtime). The runtime of a process includes the time spent executing 
user or system instructions and the time spent communicating or waiting for states to be 
imported, if A" > 1. Speedups are calculated only for model problems where the serial 
solution ran without paging (other than to load, of course). Speedup of the generation phase 
thus measures the relative cost of the communication overhead during that phase. 

Since our motivation is exploiting distributed memory, the case of highest interest is that 
of k = 5, the largest problem. Here we see evidence of a favorable computation to communi- 
cation balance (as well as good load balance), since speedup increases almost linearly in the 
number of processors. The complete serial solution required nearly an hour of computation. 
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Figure 6: Speedup on the SP-2 (k = 5). 


k 

n 

V 

1 

54 

155 

2 

810 

3,699 

3 

6,520 

37,394 

4 

35,910 

237,120 

5 

152,712 

1,111,482 

6 

537,768 

4,205,670 


Table 1: Size of the tangible reachability set and graph as a function of k. 

We also ported our distributed engine to an IBM SP-2 multiprocessor. Fig. 6 shows 
the speedup for the state-space generation on the same FMS SPN, for the case k = 5, as 
a function of N. We achieved a speedup of 11.35 when N = 16. The serial state-space 
generation requires 14 minutes on a single processor. We also experimented with larger 
values of k, and various machine sizes. With k = 6 there are 537,768 states. The time 
needed to generate the state-space varied from three hours using 2 SP-2 processors (the 
smallest configuration able to solve the problem), to 18 minutes using 32 processors. Using 
32 processors we were able to generate the k = 7 state-space (1, 639, 440 states) in 51 minutes, 
and the k = 8 state-space (4,459,455 states) in four hours. 

The conclusion we may draw from this data is that the algorithm works well, and makes 
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Figure 7: An extreme case for the distributed algorithm. 

possible the generation of state-spaces that are much larger than those usually considered 
tractable. 

6.1 Choosing a good partition function 

Choice of a good partition function is critical. It must provide both locality (if possible), 
and balance. Locality means that, in general, most of a state’s descendents are assigned 
to the same processor as is the parent state. Locality reduces communication overhead. 
Spatial balance means that each processor is assigned approximately the same number of 
states; contrast this with temporal balance which additionally calls for each processor to be 
busy most of the time. Spatial balance is sufficient if problem-solving capacity is a concern, 
whereas temporal balance is required to achieve good speedups. 

The modeling paradigm may provide clues to locality. The SPN reported in this paper is 
a good example. A Petri net state is a vector enumerating the tokens in each place. When a 
transition fires, typically only a small number of components of this vector change. Thus, if 
we base a partitioning rule on the markings of a small fixed set of places (called a control -set), 
all transition brings that do not involve control set places reflect state transitions that are 
contained entirely within a processor. To achieve spatial balance we first need to choose the 
control set so that the range of combinations of markings in its places is large. Unfortunately, 
in the worst case- one needs to generate the state-space to discover just what that range is; 
we must therefore rely upon the user’s intuition about the model to provide this property. 
Lastly, given a wide spread of markings in the control set, we .assign a state to a processor 
by applying a hashing function to the marking of its control set. 

To further illustration the difference between spatial and temporal balance, consider the 
extreme case of Fig. 7, where only one arc, t;— S( i+1 ) mo( jjVi connects S T,t to gTb+pmodiV^ 
and s 0 G S T, °. If state i; is the last one examined in each S T, \ the distributed algorithm 
will run sequentially, even if the states might be evenly allocated onto the N processes, 
and l liq number of cross arcs is certainly minimum: \ = 6. On the other hand, even in 
this unfortunate situation, the distributed algorithm would still have an advantage over the 
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sequential one, since communication overhead would be negligible, and the entire amount of 
memory available on N processors would be available. 

For the problem whose performance we studied, {P1,P2,P3} is the control set. The 
hashing function is 

(#pi + q ■ #P 2 + q 2 • #ps) mod N 

where q is a prime number (1013, in our case) and indicates the number of tokens in 
place p (in a given marking). The bar chart on the top right in Fig. 8 shows the distribution 
of states using this partitioning function with N = 6 processes. 

A good tool to decide the quality of the partition function is then the matrix of the 
numbers of edges cut by the partition, rf’ 3 . Fig. 8 describes these values for four different 
choices of the partition function. The parameters considered are six processes (N = 6) and 
five tokens initially in PI, P 2, and P 3 ( k = 5), resulting in 152,712 states and 1,111,482 
arcs in the timed-to-timed state-space and graph, respectively. These values are obviously 
independent of the partition function chosen. 

A simple hashing on a few components of the state description achieves a reasonably 
uniform allocation of states to processes. Using all the components might not be a good idea 
for several reasons: 

• If linear invariants exist relating the values of the components, and if these values are 
simply summed, the partition function might not achieve a good “random” spread. 
For example, in a closed queuing network, the choice “sum of the number of customers 
in each queue mod iV” would allocate all the states to the same process, a bad choice. 
Petri nets also often exhibit this type of invariants. 

• Multiplying the components by some power of a large prime number, as we did, elimi- 
nates most problems due to the existence of invariants. However, in this case, if every 
component is factored in the computation of the partition function, most, if not all, 
arcs will be cross-arcs, because every event changes one or more components of the 
state. 

Hence, it is best to use only a few components of the state in the definition of the 
partition function. Any event which changes only the values of the other components is then 
guaranteed not to generate any cross-arcs (e.g., tpi 2 M 3 in Fig. 4, with the first partition 
function). 

If a particular structure is desired for the partition of Q, an appropriate partition function 
might be employed. The third choice in Fig. 8, 

{#PlwMl + #P1M1 + #P2ioM2 + #P2M2 + #P3M2) mod N 
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for example, results in a block-tridiagonal structure for Q, except for two blocks, in the 
upper-right and lower-left corners, due to the wrap-around nature of the modulo operator. 
This happens because we intentionally chose the control set so that any events can change 
at most one control set place marking, by at most value 1. Such a sparsity pattern might be 
very desirable, depending on the type of communication available between the workstations. 
If they were connected in a circular fashion with bidirectional links, they could potentially 
be all communicating at the same time, since each workstation only needs to exchange data 
with its two neighbors. 

We observe that the difference between the first and second partition function in Fig. 8 
is just in the multiplication by powers of 1013 in the first case, resulting in a 10% reduction 
in the number of cross arcs, while the state distribution is substantially similar. The price 
paid to obtain the tridiagonal structure with the third partition function is instead a higher 
number of cross arcs (20% more than for the first partition function). Interestingly enough, 
the second and third partition functions result in exactly the same state distribution. This 
is due to the existence of invariants, ensuring, for example, that the number of states where 
= m or ^pituMi + ^PiMi = m is exactly the same, and so on. Finally, the fourth function 
minimizes the number of cross arcs (fewer than a quarter of the arcs are cross arcs), and also 
achieves a good distribution of states. The clearly recognizable patterns in the matrix and 
in the state distributions are due to the independence of the rest of SPN from the number of 
tokens in P 3 and P3M2. There are ( 5+ 5 _1 ) = 21 ways to distribute k = 5 tokens over three 
places. For every such combination, exactly 7,272 combinations of tokens in the other places 
exist (7,272 X 21 = 152,712, the total number of states). Hence, the value of |S' T, *| is easily 
determined once we know how many of the 21 combinations correspond to process i. The 
hashing enforced by the expression (#P 3 + #P 3 M 2 - 1013) mod N results in three combinations 
assigned to processes 0, 2, and 4 (|S' T ’°| = |5' T,2 | = |5' T,4 | = 3 x 7,272 = 21,816), and four 
combinations assigned to processes 1, 3, and 5 (|5' T ’ 1 1 = |S' T,3 | = |5' T,5 | = 4x7,272 = 29,088). 

Clearly, much work remains to be done in this area, but the good news is that even just 
moderately informed choices, such as the first two in Fig. 8 still achieve our goals of locality, 
spatial balance, and temporal balance. 

7 Conclusion and future work 

We have demonstrated the feasibility of distributing the state-space generation phase of 
discrete state stochastic system analysis using only a small network of workstations. The 
approach exploits the memories of multiple workstations, allowing one to build and perform 
logical analyses of state-spaces too large for a single processor. We stress that our approach 
provides a distributed algorithm which is independent of the particular user-level formalism 
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adopted for the specification of the model. 

To improve the applicability and usefulness of our approach, two aspects need to explored 
further. First, a centralized numerical solution is appropriate when using up to a dozen 
or so workstations, or when a machine particularly suited for numerical computation and 
equipped with a substantial amount of memory is available. However, our approach is 
a natural candidate for a completely distributed implementation, so we intend to explore 
the rich area of distributed solutions of linear systems, and implement some of the most 
appropriate techniques. This becomes a necessity if we hope to scale up to a much larger 
number of workstations. 

Second, the efficiency of our approach is highly sensitive to the partition heuristics used. 
Hence, we plan to investigate algorithms to derive a “good” partition from an automatic 
structural analysis of the model, that is, before starting to generate the state-space. This is 
doubly important because the specification of the heuristics, in addition to being a critical 
factor, is a new burden put upon the user with respect to the sequential solution. 
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(#Pi + #P 2 • 1013 + #p 3 ■ 1013 2 ) mod N 
X = 565, 920 (50.9% of rj) 
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(#Pi + #P 2 + #P 3 ) mod N 
X = 613, 737 (55.2% of rj) 
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x = 677, 700 (61.0% of rj) 



0 

1 

2 

3 

4 

5 

0 

120,906 

13,536 

0 

7,272 

14,544 

7,272 

1 

0 

161,208 

9,396 

7,272 

0 

29,088 

2 

14,544 

0 

120,906 

20,808 

0 

7,272 

3 

0 

21,816 

0 

161,208 

9,396 

14,544 

4 

0 

7,272 

14,544 

0 

120,906 

20,808 

5 

9,396 

7,272 

0 

29,088 

0 

161,208 


(#P 3 + #P 3 M 2 • 1013) mod IV 
X = 265, 140 (23.9% of rj) 






Figure 8: Number of arcs from process i to process j, rf’ 3 ( N = 6 and k = 5). 
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